Data Engineering and Computer Science

Data engineering role is ensuring uninterrupted flow of data between servers and applications

Resources

https://github.com/ossu/computer-science
What is Data Engineering and Why Is It So Important?
ETL (extract, transform, load)
Have we bridged the gap between Data Science and DevOps?
Codelabs
- Google Developers Codelabs provide a guided, tutorial, hands-on coding experience. Most codelabs will step you through the process of building a small application, or adding a new feature to an existing application

Python

See AI/Data Engineering/Python

Julia

Javascript

Bash

free GNU/Linux Online Terminal and Programming IDE

CUDA

Books

See AI/Data Engineering/Python#Books

R

Courses

See AI/Data Engineering/Python#Courses
#COURSE Intro to Hadoop and MapReduce
#COURSE Mining Massive Data Sets (CS246 Stanford)
- https://lagunita.stanford.edu/courses/course-v1:ComputerScience+MMDS+SelfPaced/about
#COURSE Getting and Cleaning Data (Coursera)
SQL:
- Tutorial and exercises
- SQL (basic, intermediate, advanced / pet problems):
  - https://community.modeanalytics.com/sql/tutorial/introduction-to-sql/
  - https://github.com/FavioVazquez/ds-cheatsheets/tree/master/SQL

Code

See AI/Data Engineering/ML Ops
#CODE ABSL.flags - Defines a distributed command line system and manual argument parsing
- https://999999999.hatenablog.com/entry/argument_parse_with_abseil
- https://github.com/abseil/abseil-py/blob/main/smoke_tests/sample_app.py
#CODE Memray - Memray is a memory profiler for Python
- https://bloomberg.github.io/memray/
- https://www.bloomberg.com/company/stories/bloomberg-memray-open-source-profiler-python-code/
#CODE mmap.ninja - Memory mapped numpy arrays of varying shapes
- You can use mmap_ninja with any training framework (such as Tensorflow, PyTorch, MxNet), etc., as it stores your dataset as a memory-mapped numpy array
- A memory mapped file is a file that is physically present on disk in a way that the correlation between the file and the memory space permits applications to treat the mapped portions as if it were primary memory, allowing very fast I/O
#CODE Polars - Fast multi-threaded, hybrid-out-of-core DataFrame library in Rust | Python | Node.js
- https://www.pola.rs/benchmarks.html
- https://towardsdatascience.com/pandas-vs-polars-a-syntax-and-speed-comparison-5aa54e27497e
#CODE Pandas AI/Data Engineering/Pandas
#CODE Modin - Scale your pandas workflows by changing one line of code
#CODE Xarray AI/Data Engineering/Xarray
#CODE Dedupe - A python library for accurate and scaleable fuzzy matching, record deduplication and entity-resolution
- http://blog.districtdatalabs.com/basics-of-entity-resolution
#CODE PyTables
- http://www.pytables.org/
#CODE H5py
#CODE Singer - Simple, Composable Open Source ETL
#CODE Docker
- https://towardsdatascience.com/docker-for-data-science-4901f35d7cf9
#CODE Kubernetes - K8s is an open-source system for automating deployment, scaling, and management of containerized applications.

Subtopics

Open datasets (for ML, DL and DS)

See AI/Data Engineering/Open ML data

MLOps

See AI/Data Engineering/ML Ops

Feature engineering

https://en.wikipedia.org/wiki/Feature_engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. It is fundamental to the application of ML, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning
http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
https://tech.zalando.com/blog/feature-extraction-science-or-engineering/

Feature extraction

See AI/Feature learning techniques in AI/Computer Vision/Computer Vision

Data mining

Web scraping

API

Databases

SQL

https://en.wikipedia.org/wiki/SQL
https://en.wikipedia.org/wiki/Relational_database
A relational database is a digital database whose organization is based on the relational model of data.
https://www.analyticsvidhya.com/blog/2017/01/46-questions-on-sql-to-test-a-data-science-professional-skilltest-solution/
Tutorial and exercises
SQL (basic, intermediate, advanced / pet problems)
List of SQL Commands
JOIN
- A SQL join clause combines columns from one or more tables in a relational database. It creates a set that can be saved as a table or used as it is. A JOIN is a means for combining columns from one (self-table) or more tables by using values common to each. ANSI-standard SQL specifies five types ofJOIN:INNER,LEFT OUTER,RIGHT OUTER,FULL OUTER and CROSS.
- https://periscopedata.com/blog//how-joins-work.html
https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-postgresql-a-comparison-of-relational-database-management-systems
Python interface

NoSQL

https://en.wikipedia.org/wiki/NoSQL
Not only SQL: A NoSQL database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. NoSQL databases are increasingly used in big data and real-time web applications. Many NoSQL stores compromise consistency (in the sense of theCAP theorem) in favor of availability, partition tolerance, and speed.
Column: Accumulo, Cassandra, Druid, HBase, Vertica, SAP HANA
#TALK GOTO 2012 - Introduction to NoSQL - Martin Fowler
Graph:
- A graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. A key concept of the system is the graph (or edge or relationship), which directly relates data items in the store. The relationships allow data in the store to be linked together directly, and in many cases retrieved with a single operation.
- Graph databases employ nodes, edges and properties.
  - Nodes represent entities/items you might want to keep track of (people, businesses, accounts).
  - Edges, also known as graphs or relationships, are the lines that connect nodes to other nodes; they represent the relationship between them.
  - Properties are pertinent information that relate to nodes (sort of keywords).
  - AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog
  - https://neo4j.com/developer/graph-database/
Key-value
- https://en.wikipedia.org/wiki/Key-value_database
- A key-value store, or key-value database, is a data storage paradigm designed for storing, retrieving, and managing associative arrays, a data structure more commonly known today as a dictionary or hash.
- Dictionaries contain a collection of objects, or records, which in turn have many different fields within them, each containing data. These records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data within the database.
Document-oriented database

Data munging

https://www.coursera.org/learn/data-cleaning

Data preparation

Exploratory data analysis

Big data

MapReduce

https://en.wikipedia.org/wiki/MapReduce
MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster
A MapReduce program is composed of aMap() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and aReduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)

Resources

Python

Julia

Javascript

Bash

CUDA

Books

R

Courses

Code

Business Intelligence

Big data, distributed computing

Databases

Subtopics

Open datasets (for ML, DL and DS)

MLOps

Feature engineering

Feature extraction

Data mining

Web scraping

API

Databases

SQL

NoSQL

Data munging

Data preparation

Exploratory data analysis

Big data

MapReduce